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Abstract 


Independent checkpointing allows maximum process autonomy but suffers from potential domino 
effects. Coordinated checkpointing eliminates the domino effect by sacrificing a certain degree of 
process autonomy. In this paper, we propose the technique of lazy checkpoint coordination which 
preserves process autonomy while employing communication-induced checkpoint coordination for 
bounding rollback propagation. The introduction of the notion of laziness allows a flexible trade- 
off between the cost for checkpoint coordination and the average rollback distance. Worst-case 
overhead analysis provides a means for estimating the extra checkpoint overhead. Communication 
trace-driven simulation for several parallel programs is used to evaluate the benefits of the proposed 
scheme for real applications. 
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1 Introduction 


Independent (or uncoordinated) checkpointing [1 — 3] for parallel and distributed systems allows 
maximum process autonomy and independent design of recovery capability for each process. How- 
ever, since the rollback of a message sender requires the sympathetic rollback [4] of the receiver, 
the domino effect [5] is in general possible unless certain mechanisms are incorporated into the 
checkpointing and recovery protocol to guarantee recovery line [6] progression. Existing techniques 
for achieving domino-free rollback recovery can be classified into two primary categories [7]. The 
first category can be called the minimum sympathetic rollback approach in which either the rollback 
of a process will never undo any messages sent or the receiver of an undone message M will try to 
roll back to the state immediately before receiving M. Wu and Fuchs [8] insert a checkpoint imme- 
diately after each message is sent so that no sympathetic rollback is necessary for any failure. Kim 
et al. [9,10] and Venkatesh et al. [11] employ dependency tracking and insert extra checkpoints 
before processing any messages that result in new dependency. The state-interval based approach 
[12-21] models the program execution as consisting of a number of state intervals, each started by 
processing a new message. Message logging in addition to checkpointing is employed to effectively 
insert an “checkpoint” (in the optimized form of a message log) before each message receipt. 

The second category can be called the bounded rollback propagation approach. Corresponding 
checkpoints (based on the ordinal numbers) on different processes are required to coordinate with 
each other in order to form a recovery line to bound the possible rollback propagation. Usually, 
whenever a checkpoint is initiated by one process, all other processes are informed and required 
to take appropriate checkpoints to guarantee the resulting set of checkpoints is consistent [22-27]. 
The number of processes required to participate in each checkpointing session can be reduced by 
monitoring the recent message exchanging history [28]. For systems with clock synchronization 
and/or bounded message transmission delay, the cost for checkpoint coordination can be further 
reduced [29-32]. 

We will use the term eager checkpoint coordination for the coordination action performed when 
checkpoints are initiated (as described above). In contrast, processes in a system with lazy check- 
point coordination only coordinate their corresponding checkpoints when the message communica- 
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tion indicates a violation of checkpoint consistency 2 . Briatico et al. [35] force the receiver of a 
message M to take a checkpoint before processing M if the sender’s checkpoint ordinal number 
tagged on M is greater than that of the receiver. Checkpoints with the same ordinal numbers 
are therefore always guaranteed to be consistent. However, the run-time overhead may be pro- 
hibitively high due to the possibly excessive number of extra induced checkpoints. In this paper, 
we generalize the concept of communication-induced checkpoint coordination by introducing the 
notion of laziness Z as a measure of the frequency for performing coordination. Only corresponding 
checkpoints with ordinal numbers nZ , where n is an integer, are required to be consistent with each 
other and form the recovery line for bounding rollback propagation. Overhead analysis and exper- 
imental evaluation show that lazy checkpoint coordination can significantly reduce the number of 
extra checkpoints and offer a flexible trade-off between run-time overhead versus average rollback 
distance. 

The paper is organized as follows. Section 2 describes the system model and the checkpointing 
and recovery protocol; Section 3 gives the motivation and the algorithm for lazy checkpoint coor- 
dination; Worst-case overhead analysis is presented in Section 4 and the trace-driven simulation 
results for several parallel programs are discussed in Section 5. 


2 Checkpointing and Rollback Recovery 

The system considered in this paper consists of a number of concurrent processes for which 
all process communication is through message passing. Processes are assumed to run on fail-stop 
processors [36] and each processor is considered as an individual recovery unit [15]. We do not 
assume the piecewise deterministic execution model [20]. 

During normal execution, the state of each processor is periodically saved as a checkpoint on 
stable storage. Let CPi } k denote the kth checkpoint of processor pi with k > 0 and 0 < i < N — 1, 
where N is the number of processors. A checkpoint interval is defined to be the time between 
two consecutive checkpoints on the same processor and the interval between and CP^k+ 1 ) 

2 The basic idea motivating the lazy checkpoint coordination is similar to the concepts behind the lazy release 
consistency in distributed shared memory [33] and the lazy message cancellation in optimistic distributed simulation 
systems [34]. 
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is called the Jfcth checkpoint interval. Each message is tagged with the current checkpoint ordinal 
number and the processor number of the sender. Each processor takes its checkpoint independently 
and updates the direct dependency information table (or input table [2]) as follows: if at least one 
message from the mth checkpoint interval of processor pj has been processed during the previous 
checkpoint interval, the pair (j, m) is added to the table entry for the new checkpoint. 

A centralized garbage collection algorithm [37] can be periodically invoked by any processor. 
First, the dependency information for all existing checkpoints is collected to construct the checkpoint 
graph [1] (Fig. 1(b)). All checkpoints corresponding to the vertices marked “X” in Fig. 1 (b) are 
determined to be garbage by the algorithm and can therefore be discarded. 

When processor p, initiates a rollback, it sends out a rollback-initiating message [2] to ev- 
ery other processor to request the up-to-date dependency information. Each surviving processor 
takes a virtual checkpoint (represented by the dotted vertex in Fig. 1 (c)) upon receiving the roll- 
back-initiating message. After receiving the responses, pi constructs the extended checkpoint graph 
[1] and executes the rollback propagation algorithm shown in Fig. 2 to determine the recovery line 
(the shaded vertices in Fig. 1 (c)). A rollback-request message is then broadcast to roll back each 
processor according to the recovery line (Fig. 1 (d)). 

There are two primary checkpoint consistency situations. In Fig. 3(a), the checkpoints CPi t k 
and CPj iTn are inconsistent because of the orphan message [31] M a . In Fig. 3(b), CPi t k and C P hm 
can become consistent if the channel-state message [24] Mb is properly recorded. In this paper, we 
assume either every message is synchronously logged 3 [12, 14] or an end-to-end transmission protocol 
can guarantee the redelivery of the lost channel-state messages [28]. Therefore, checkpoints like 
CPi t k and CPj, m in Fig. 3(b) are considered consistent. 

3 Lazy Checkpoint Coordination 


3.1 Motivation 

We will refer to the checkpoints initiated independently by each processor as basic checkpoints and 
those triggered by the communication as induced checkpoints . Fig. 4(a) illustrates the situation 

3 Discussions on incorporating an asynchronous logging protocol into the independent checkpointing scheme de- 
scribed in this section can be found in [3]. 
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Figure 1: Checkpointing and rollback recovery (a) the checkpoint and communication pattern (b) 
checkpoint graph for garbage collection (c) extended checkpoint graph when po initiates the rollback 
(d) checkpoint graph after recovery. 


/* CP stands for checkpoint */ 

/* Initially, all the CPs are unmarked */ 

Include the latest CP of each processor in the root set; 

Mark all CPs strictly reachable from any CP in the root set; 

While (at least one CP in the root set is marked) { 

Replace each marked CP in the root set by the latest unmarked CP on the same 
processor; 

Mark all CPs strictly reachable from any CP in the root set; 

} 

The root set is the recovery line. 


Figure 2: The rollback propagation algorithm. 
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(a) (b) 


Figure 3: Checkpoint consistency (a) orphan message (b) channel- state message. 

where the communication pattern renders most of the basic checkpoints useless for rollback recovery 
and the only recovery line is at the very beginning of the execution. A straightforward way of 
avoiding such unbounded rollback propagation is to perform eager checkpoint coordination as shown 
in Fig 4(b). Whenever a processor initiates a basic checkpoint, coordination messages (dotted 
arrows) are broadcast to all other processors to request the cooperation in making a consistent set 
of checkpoints [23]. Let B be the total number of basic checkpoints and I be the total number of 
induced checkpoints. We define the induction ratio 1Z as 

K = J W 

which is a measure of the overhead for performing communication- induced checkpoint coordination. 
Clearly, eager checkpoint coordination has 7Z = N — 1 and will result in large run-time overhead 
when N is large. In addition, the N — 1 coordination messages per checkpoint session constitute 
another overhead. 

The large overhead of eager checkpoint coordination results from its pessimistic nature. More 
specifically, when p\ in Fig 4(b) initiates its first basic checkpoint &14 4 , it “pessimistically” assumes 
that messages like M\ will exist in the future and cause b\ ^ to be inconsistent with its corresponding 
checkpoint 60,1 on po* In order to guarantee belongs to a useful recovery line, p\ “eagerly” 
requests po’s cooperation at the time &14 is initiated. In contrast, lazy checkpoint coordination 
adopts an optimistic approach by assuming that &o,i will be consistent with b\ If the assumption 
turns out to be true, no explicit coordination is necessary. An extra checkpoint will be induced on po 
only when the message M\ indicates that the assumption has failed (Fig 4(c)). From another point 
of view, such a scheme “lazily” delays the broadcast of the coordination messages and implicitly 

denotes the fcth basic checkpoint of pi and CPi y k denotes the Jfcth checkpoint of p t . 
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+ Basic checkpoint 0 Induced checkpoint 

Figure 4: Communication-induced checkpointing (a) the checkpoint and communication pattern (b) 
eager checkpoint coordination (c) lazy checkpoint coordination with laziness = 1 (d) lazy checkpoint 
coordination with laziness = 2. 
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piggybacks them on future normal messages. Both checkpoint and message overhead can therefore 
be reduced. 

However, given a basic checkpoint pattern, the number of induced checkpoints in the above 
scheme is determined by the communication pattern and is not otherwise controllable. In the worst 
case, the induction ratio 'll can still be N - 1 as illustrated in Fig 4(c). In order to further reduce 
the overhead, we can perform even “lazier” coordination by only enforcing the consistency between 
checkpoints CP 0 ,nZ and CP\, n z where Z is called the laziness and n is an integer. Fig 4(d) shows the 
case with Z — 2. No checkpoint is induced until the message M 2 indicates the inconsistency between 
b \ t 2 and 60 , 2 - The number of induced checkpoint can be reduced from 8 (Fig 4(c) with Z — 1 ) to 2 at 
the cost of potentially larger rollback distance. It becomes clear that lazy checkpoint coordination 
can provide a trade-off between the checkpointing overhead and average rollback distance. 

3.2 The Protocol 

Our approach is to incorporate the lazy checkpoint coordination into the independent checkpoint- 
ing scheme as a mechanism for bounding rollback propagation. Therefore, the checkpointing and 
rollback recovery protocol can be built on top of the one described in Section 2. During nor- 
mal execution, each processor still takes its basic checkpoints independently. The laziness Z is a 
predetennined-determined system parameter known to all processors. Suppose a processor pj with 
current checkpoint ordinal number r is about to process a message M with sender p,-’s ordinal 
number s. If p : detects the following condition to be true 

l = [s/Z\ > |r/ZJ, 

it realizes that CPijz and CPjjz will be inconsistent unless an extra checkpoint is induced before 
M is processed. We describe a possible implementation as follows. Each processor pj maintains a 
variable V which is initialized to be Z and incremented by Z each time CPj, n z is taken. Before pj 
processes a message M with s > V, it is forced to take the checkpoint CPjjz and update its ordinal 
number counter to IZ. In other words, if M was sent after CPijz was taken, it must be processed 
by pj after CPjjz is induced. Notice that all checkpoints CPj iTn with r < m < IZ become dummy 
checkpoints which overlap with CPjjz . 

In addition to the centralized garbage collection algorithm [37], a simple distributed algorithm 
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can also be used for low-cost garbage collection. The basic idea is that if the current checkpoint 
ordinal number of every processor has exceeded nZ, all the checkpoints CPj, m with m < nZ 
becomes obsolete with respect to the recovery line consisting of {C P l>n z : 0 < i < N - 1} and 
therefore can be discarded. Each processor pj needs to maintain an array CP.progress[N] which 
records the highest ordinal number for every other processor known to pj based on the information 
included in each message. More efficient garbage collection can be achieved by piggybacking the 
CP-progress[N] array on the normal messages periodically in order to maintain the “transitive” 
knowledge of checkpointing progress of each processor [38]. 

Although the set of checkpoints {CPi, n Z : 0 < i < N - 1} always forms a recovery line, the 
two-phase recovery procedure described in Section 2 should still be used to search for the most 
recent recovery line in order to minimize the number of rolled-back processors and the rollback 
distance. One possible optimization is that the dependency information corresponding to the 
garbage checkpoints as determined based on the CP-progress[N\ array needs not be collected, thus 
reducing the size of the responses to the rollbackjnitiating message and the time for constructing 
the checkpoint graph. 


4 Overhead Analysis 


Since the checkpoint overhead of the lazy checkpoint coordination scheme depends on the run- 
time dynamic communication pattern, it is important to analyze and estimate the potential extra 
overhead resulting from the induced checkpoints. We will first show that, without any constraints 
on the relative checkpointing progress of each processor, the worst-case induction ratio is (N — l)/Z. 
While under certain conditions which are typically met by real applications, the upper bound on 
the induction ratio can be shown to be independent of N . 

4.1 Worst-Case Analysis 

Our approach to worst-case analysis consists of two steps. First, given any fixed basic checkpoint 
pattern, we construct the worst-case communication pattern. Secondly, given any system with N 
processors, we derive the worst-case induction ratio as a function of N and the laziness Z. 
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In this section, we assume each checkpoint CPi^ is associated with a global time stamp 
t(CPi } k) 5 . For any checkpoint and communication pattern V, define CP^ nZ = C P? nZ 6 if t(C P^ nZ ) < 
t{CPf nZ ) for all 0 < j < N - 1, i.e., CPf nZ denotes the earliest checkpoint #nZ among all pro- 
cessors. Given any basic checkpoint pattern and the laziness Z, we construct the communication 
pattern Vq as follows. Suppose CP*° nZ = CP^° Z . Then p< sends a message to every other processor 
and induces CPf t ° z with t(CPf° z ) « t{CP^° z ) on processor pj. Fig. 5(a) shows an example of Vo 
with Z = 2. We will call the interval between t{CP^ n _ x ) Z ) and t(CP^ Z ) the induction session 
#n which includes all the induced checkpoints CPf° z . The following lemma will be used to prove 
Vo is the worst-case communication pattern in terms of the induction ratio. 

LEMMA 1 Given a basic checkpoint pattern, we have t(CP^ z ) < t(C P+ nZ ) for arbitrary com- 
munication pattern V and any positive integer n. 

Proof. The proof is given by induction on n. Since there can not be any induced checkpoint 
before t{CPf z ) for any V, t{C P^ z ) only depends on the progress of taking basic checkpoints. 
Therefore, t(C P^ z ) = t(CP+ z ) and the case n — 1 is true. For the case n = k, suppose CPf > kZ = 
CPf kZ . All the Z checkpoints CPf { with (k-l)Z < l < kZ must be basic checkpoints because they 
can not be induced before t(CPff kZ ). Also, t(C P^ k _ 1 ) Z ) < t(CPf^ fc_i)z) < t(CP^i) < t{CPf kZ ) 
by definition. Suppose the case n = k — 1 is true, i.e., i ( C P^° k _^ z ) < t(CP^ f k _^ z ). We then have 
CPr kZ = CPf ° where q > kZ because Pf° k _^ z ) « P^^ k _^ z ) by construction and there are 

at least Z basic checkpoints of p,-, i.e., the CPf{ s, between Pf° k _ l ) Z ) and t(CP? kZ ). Finally, 

>(CPl % z ) < t(CPfi z ) < t(CP*°) = t (CP? k z) = t (CP?j,z) 

and we have proved t(CP^ z ) < t{CP^ nZ ) for all positive integer n. □ 

LEMMA 2 Given a basic checkpoint pattem f Vo is the worst-case communication pattern resulting 
in the largest induction ratio. 

5 This is only for the purpose of presentation. 

6 We will use CPf* k to denote the kth checkpoint of pi in the checkpoint and communication pattern V. When it is 
clear from the context that the basic checkpoint pattern is fixed, we also use the same notation for the communication 
pattern V. 
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(a) 



(b) 


Figure 5: (a) Worst-case communication pattern (b) worst-case checkpoint and communication 
pattern. 

Proof. Let I„ denote the total number of induced checkpoints with ordinal number nZ for 
the communication pattern V, and let q = max{n : 1% ^ 0}. be the maximum among n’s such 
that I „ ^ 0. Clearly, 1% < N — 1. Since t(CP? qZ ) < by Lemma 1, the checkpoint and 

communication pattern with Vo must consist of at least q induction sessions. Let I v denote the 
total number of induced checkpoints for V, we then have 

i r °> E iZ° = q-(N-i)> E In=i v - 

l<n<g 

Finally, because the number of basic checkpoints is fixed by the given basic checkpoint pattern, Vo 
has the largest induction ratio among all possible communication patterns. □ 

Lemma 2 states that, for worst-case analysis of the induction ratio, we need only consider the 
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communication pattern Vo for each basic checkpoint pattern. Because the induction sessions are 
well-defined in such patterns (as shown in Fig. 5), the derivations can be simplified. 

THEOREM 1 For any system with N processors and laziness Z, the induction ratio 

_ ^ N- 1 

Proof. For any basic checkpoint pattern with its corresponding Vo which results in L complete 
induction sessions, the number of induced checkpoints is L • ( N — 1). Let B n denote the number of 
basic checkpoints within the induction session #n, we have B n > Z for all 1 < n < L because the Z 
checkpoints CPf J° with (n - 1 )Z < l < nZ can not be the induced checkpoints if C P*° nZ — C p HnZ- 
Therefore, the induction ratio 

L-(N-l) L-jN-l) N- 1 

Ei<n < L B n + B L+ i~ L-Z Z 

□ 

Fig. 5(b) shows an example of the worst case for N = 3 and Z — 2. The stacked checkpoints 
indicate the fact that each dummy checkpoint CP?$ n _ x overlaps with the induced checkpoint CP?£ n . 
Since it takes exactly Z = 2 basic checkpoints to induce every N — 1 = 2 checkpoints, the induction 
ratio is ( N — 1 )/Z = 1. 

4.2 The Upper Bound under Constraints 

The upper bound in Theorem 1 was derived under no constraints on the program behavior. Since 
it is of order O (N), the induction ratio may be unacceptably high for systems with large number 
of processors. However, a closer look at the two patterns in Fig. 5 reveals that the situation in (b) 
which results in the worst-case induction ratio is less likely to happen for real applications where* 
the processors typically regularize their paces in taking basic checkpoints, as shown in (a). For 
example in Fig. 5(b), it is very likely for po to take at least one basic checkpoint between CP 
and CP^q. We can show that under the following constraints which are usually satisfied in real 
applications, the upper bound on the induction ratio is independent of N. 

Constraint 1: Let Q denote the ratio of the maximum to the minimum length of the basic check- 
point interval. Although each processor is allowed to take its basic checkpoints at its own pace, 
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Q is typically bounded by a small constant Q . For example, Q is 2 or 3 for our experiments 
described in the next section. 

Constraint 2: Only the cases with Z > 2 will be considered for refined upper bounds because the 
worst case for Z = 1 is always achievable even when Q is small (see Fig. 4(c)). 

Constraint 3: The applications employing checkpointing and rollback recovery are usually long- 
running jobs, which implies Z • L is quite large. (Recall L is the number of complete induction 
sessions with Vq .) In particular, we assume Z ■ L ~S> [<3"| - 

THEOREM 2 Under the above constraints, the induction ratio 1Z < \Q] . 

Proof. Again we only have to consider Vq for each basic checkpoint pattern for the worst case. 
Let M denote the smallest integer such that M -(Z — 1) > Q. Since Z > 2 by Constraint 2, we have 
M < [Q] . We define an M -induction session as consisting of M consecutive induction sessions. 
There axe then Lm = \.L/M\ complete M-induction sessions, each containing M • (N — 1) induced 
checkpoints. We consider the following two cases. 

(a) N < Mi By Theorem 1, 

TZ < —z— <N — 1<N<M< TQ]. (2) 

Zi 

(b) N > M: First we consider the number of induced checkpoints I. If Z > Q + 1, then M = 1 

and I = L • (N — 1). H Z < Q + l, Z • L |"Q] in Constraint 3 implies L >■ [Q]. Since 
M < \Q] , we have L/M » 1 and 

I = Lm-M-{N- 1)+ Ik* Lm-M -{N - 1). 

In either case, I « Lm • M • (iV — 1). 

Now consider the number of basic checkpoints 5. For each induction session the processor 
Pi with CPfa = CP^z must contribute Z basic checkpoints and therefore the length of 
each induction session is at least Z — 1 basic checkpoint intervals. Within each Af-induction 
session, at least N - M processors do not have CP?* Z = CP^ nZ for any n. By the definition 
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of Q , these N - M processors must each contribute at least [ — ^ basic checkpoints. 

Therefore, 

B > L M • (M ■ Z + (N - M) • [ - ’ 


and 

~ . M-(AT-l) 

B ~ M • Z + (N - M) • ' 

Since Z > 1 and > 1 by definition, we have 


n < 


M -(N - 1) 
M + (N - Af) 


< m < fgi. 


(3) 


(4) 

□ 


Notice that Eqs. (3) and (4) axe still valid if we replace M with any m such that M <m<\Q]. 
By combining Theorem 1, Eq. (2) and Eq. (3), we then define the refined upper bound, called the 
Q — bound, as follows. 

f m • ( N — 1) 1 

Q - bound = min M < m < rgl {— - - - - m] . ((JV _ m) . (5) 

where [N > m] = 1 if N > m is true and 0 otherwise. 

Fig. 6(a) compares the worst-case induction ratio with the Q — bound where Q = 2 for N = 8, 16 
and 32. While the worst-case ratio (N - 1 )/Z clearly grows with N , the Q - bound is relatively 
insensitive to N . Fig. 6(b) compares the worst-case induction ratio, which is equivalent to the 
Q — bound with Q = oo, with the Q — bound where Q varies from 2 to 5. Since our purpose of 
introducing the Q — bound is to estimate the induction ratio for real applications in advance, the 
insensitivity of the Q— bound to the exact value of Q suggests that an approximate value of Q suffices 
for the estimation. Finally, notice that if Z is chosen to be at least Q + 1, we have 1Z < M = 1, 
i.e., the number of induced checkpoints will never exceed the number of basic checkpoints. 


5 Experimental Results 

Four parallel programs written in the Chare Kernel language are used for the communica- 
tion trace- driven simulation. The Chare Kernel has been developed as a medium-grain, machine- 
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Figure 6: (a) Worst-case induction ratio and the Q — bounds (Q=‘2) for various N (b) worst-case 
induction ratio ( N = 32) and the Q — bounds for various Q , 
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independent parallel language [39]. Programs written in the Chare Kernel language can run un- 
changed on both shared-memory and distributed-memory machines such as Encore Multimax, Se- 
quent Symmetry, Intel iPSC/2 and i860 hypercubes and a network of Sun workstations. Program 
traces used in this paper are collected from an Multimax 510. 

The four programs include two newly developed CAD applications, Test generation and Logic 
synthesis, and two search applications, Knight tour and N queen. The execution times are between 
25 and 45 minutes (see Table 1). The total number of messages ranges from tens to hundreds of 
thousands. Our simulation uses the following scheme for inserting checkpoints. The predetermined 
minimum basic checkpoint interval is chosen to be 2 minutes. A variable Next.CP.Time is initialized 
to 2 minutes. Each processor checks its local clock after processing every 100 messages. If the clock 
time exceeds Next.CP.Time, a basic checkpoint is inserted and Next.CP.Time is incremented by 
2 minutes. The resulting average basic checkpoint interval (CPI) for each program is listed in 
Table 1. Before processing each message, the processor also checks if an induced checkpoint and 
the corresponding update of the ordinal number counter are necessary, as described in Section 3. 
All reported numbers are averaged over five runs. 

Table 1: Execution and checkpoint parameters of the Chare Kernel programs. 


Programs 

Test 

generation 

Logic 

synthesis 

Knight 

tour 

N 

queen 

Number of processors- 

8 

6 

8 

6 

Execution time (sec) 

2,076 

1,736 

2,436 


Number of messages 

28,219 

411,733 

104,170" 

25,880 

Average number of basic 
checkpoints per processor 



18.0 

10.5 

Average basic CPI (sec) 

158 

140 

132 

139 

Q 

2.17 

2.48 

1.42 

1.55 

Under-2 percentage 

99.6% 

97.0% 

100% 

100% 


We expect the variation of the basic checkpoint interval to be small because of the way it is 
maintained. In particular, we choose Q = 2 to estimate the induction ratio. The exact value of 
Q for each program is listed in Table 1. Although Q is slightly greater than 2 for the first two 
programs, the numbers listed in the row of “Under-2 percentage” shows that a very high percentage 
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Figure 7: Checkpoint coordination overhead as a function of laziness (a) Test generation (b) Logic 
synthesis (c) Knight tour (d) N queen. 
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Average 
Rollback 
Distance 
(# basic CPIs) 



Figure 8 : Average rollback distance as a function of the laziness. 

of the basic checkpoint intervals are covered by Q — 2. Also, since the Q — bound is insensitive to 
the exact value of Q y Q = 2 should suffice for our purpose. Fig. 7 plots the Q - bounds against the 
worst-case and the actual induction ratios for the four programs. It is shown that the Q — bound 
provides a good estimation of the induction ratio for real applications. The large difference in the 
ratio between Z = 1 and Z > 2 confirms that our generalization of the idea of communication- 
induced checkpoint coordination as described in [35] can significantly reduce the extra checkpoint 
overhead. 

Fig. 8 gives the average rollback distances in terms of the number of average basic CPIs. The 
almost linear behavior can be explained as follows. Every N basic checkpoints jt’s, 0 < i < N — 1, 
are taken at approximately the same time £*. If any one of them, say is CP+ in z, then either 
is consistent with bj t k or CP{ yn z is induced shortly due to the relatively large number of messages. 
Hence, a recovery line is formed around For Z = I, that means the average rollback distance is 
at most 0.5 basic CPI and the exact value will depend on the offset between b{ y k s at run-time. For 
Z > 2 , as long as some CPi yTi z ' s are induced before b^s are initiated, 6 ^*’ s become CP;, n z+ i’ s and 
one of 6 lt jt + ( 7 -i)’s will become which means a new recovery line will very likely to exist 

around Therefore, the average rollback distance is approximately (Z — l )/2 basic CPIs 

as shown by the curve named “Estimated” in Fig. 8 . It becomes clear that Figs. 7 and 8 provide a 
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flexible trade-off between run-time overhead and recovery efficiency. 


6 Concluding Remarks 


We have proposed the technique of lazy checkpoint coordination and incorporated it into the in- 
dependent checkpointing protocol as a mechanism for bounding rollback propagation. The recovery 
line is guaranteed to move forward by performing communication-induced checkpoint coordination 
only when the predetermined consistency criterion is about to be violated. The notion of laziness 
was introduced to provide the trade-off between extra checkpoint overhead during normal execution 
versus the average rollback distance for recovery. Overhead analysis shows that the upper bound on 
the induction ratio, i.e., the number of induced checkpoints divided by the number of basic check- 
points, is related to the maximum ratio between the basic checkpoint intervals. Communication 
trace-driven simulation results for several parallel programs showed that our analysis can provide a 
good estimation for the induction ratio, and lazy checkpoint coordination can significantly reduce 
the extra checkpoint overhead for real applications. 
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